Exploratory analysis using Xorbits over NYC taxi dataset¶

Xorbits is a powerful tool for exploring and analyzing large datasets. One of the classic datasets for demonstrating the capabilities of Xorbits is the NYC taxi dataset, which contains records of taxi rides in the city from 2009 to 2022. In this blog, we will explore how to use Xorbits to do some initial exploration of the NYC taxi dataset and get a sense of what kind of insights we might be able to gain from the data.

Software versions¶

  • Xorbits==0.1.0
  • plotly==5.11.0
In [ ]:
# Install dependencies
%pip install xorbits==0.1.0 plotly==5.11.0

Datasets¶

  • TLC Trip Record Data
  • NYC Taxi Zones

Download the taxi zone lookup table and the taxi zone geojson:

In [ ]:
%%bash
wget https://d37ci6vzurychx.cloudfront.net/misc/taxi+_zone_lookup.csv
wget https://data.cityofnewyork.us/api/geospatial/d3c5-ddgc\?method\=export\&format\=GeoJSON -O taxi_zones.geojson

You can start with yellow taxi trip records of a month:

In [ ]:
%%bash
wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-12.parquet

You can also use records of a year:

In [ ]:
%%bash
for i in {1..12}
do
    wget https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2021-$(printf "%02d" $i).parquet
done

Initialization¶

The first step is to initialize Xorbits:

In [ ]:
import xorbits

# Initialize Xorbits in the local environment.
xorbits.init()

Data loading¶

The second step is to load the data into an Xorbits DataFrame. This can be done using the read_parquet() function, which allows us to specify the location of the parquet file and any additional options we want to use while reading the data.

In the case of the NYC taxi dataset, here is an example of how we could do this using Xorbits:

In [5]:
import datetime
import json
import xorbits.pandas as pd

trips = pd.read_parquet("yellow_tripdata_2021-*.parquet")
# Remove outliers
trips = trips[(trips['tpep_pickup_datetime'] >= datetime.datetime(2021, 1, 1)) & (trips['tpep_pickup_datetime'] <= datetime.datetime(2021, 12, 31))]

taxi_zones = pd.read_csv('taxi+_zone_lookup.csv')

with open('taxi_zones.geojson') as fd:
    geojson = json.load(fd)

Once we have the data loaded into a DataFrame, we might want to get a sense of the overall structure of the data by looking at the number of rows and columns, the data types of each column, and the first few rows of the data. We can do this using the shape, dtypes, and head() attributes, respectively:

In [6]:
print(trips.shape)
print(trips.dtypes)
print(trips.head())
(nan, 19)
VendorID                          int64
tpep_pickup_datetime     datetime64[ns]
tpep_dropoff_datetime    datetime64[ns]
passenger_count                 float64
trip_distance                   float64
RatecodeID                      float64
store_and_fwd_flag               object
PULocationID                      int64
DOLocationID                      int64
payment_type                      int64
fare_amount                     float64
extra                           float64
mta_tax                         float64
tip_amount                      float64
tolls_amount                    float64
improvement_surcharge           float64
total_amount                    float64
congestion_surcharge            float64
airport_fee                     float64
dtype: object
  0%|          |   0.00/100 [00:00<?, ?it/s]
   VendorID tpep_pickup_datetime tpep_dropoff_datetime  passenger_count  \
0         1  2021-12-01 00:19:51   2021-12-01 00:37:01              1.0   
1         2  2021-12-01 00:29:07   2021-12-01 00:45:13              2.0   
2         1  2021-12-01 00:12:40   2021-12-01 00:27:17              0.0   
3         1  2021-12-01 00:10:18   2021-12-01 00:19:20              1.0   
4         1  2021-12-01 00:25:12   2021-12-01 00:39:07              1.0   

   trip_distance  RatecodeID store_and_fwd_flag  PULocationID  DOLocationID  \
0           9.30         1.0                  N           138           141   
1           2.76         1.0                  N           238            42   
2           3.40         1.0                  N           239            74   
3           1.30         1.0                  N           148            87   
4           3.10         1.0                  N           231           246   

   payment_type  fare_amount  extra  mta_tax  tip_amount  tolls_amount  \
0             1         26.5   4.25      0.5        7.60          6.55   
1             2         13.0   0.50      0.5        0.00          0.00   
2             1         13.5   3.00      0.5        2.00          0.00   
3             1          6.5   3.00      0.5        2.05          0.00   
4             1         12.5   3.00      0.5        2.50          0.00   

   improvement_surcharge  total_amount  congestion_surcharge  airport_fee  
0                    0.3         45.70                   2.5         1.25  
1                    0.3         16.80                   2.5         0.00  
2                    0.3         19.30                   2.5         0.00  
3                    0.3         12.35                   2.5         0.00  
4                    0.3         18.80                   2.5         0.00  

We can also use the describe() method to get a summary of the statistical properties of each numerical column in the dataset.

In [7]:
print(trips.describe())
  0%|          |   0.00/100 [00:00<?, ?it/s]
           VendorID  passenger_count  trip_distance    RatecodeID  \
count  3.143916e+06     3.043266e+06   3.143916e+06  3.043266e+06   
mean   1.701623e+00     1.440766e+00   7.190938e+00  1.176653e+00   
std    4.912976e-01     1.016665e+00   7.283667e+02  3.266358e+00   
min    1.000000e+00     0.000000e+00   0.000000e+00  1.000000e+00   
25%    1.000000e+00     1.000000e+00   1.100000e+00  1.000000e+00   
50%    2.000000e+00     1.000000e+00   1.840000e+00  1.000000e+00   
75%    2.000000e+00     2.000000e+00   3.430000e+00  1.000000e+00   
max    6.000000e+00     9.000000e+00   3.350937e+05  9.900000e+01   

       PULocationID  DOLocationID  payment_type   fare_amount         extra  \
count  3.143916e+06  3.143916e+06  3.143916e+06  3.143916e+06  3.143916e+06   
mean   1.647513e+02  1.625591e+02  1.213615e+00  1.429232e+01  1.058308e+00   
std    6.571399e+01  7.032944e+01  5.166679e-01  1.361084e+01  1.271117e+00   
min    1.000000e+00  1.000000e+00  0.000000e+00 -6.060000e+02 -4.500000e+00   
25%    1.320000e+02  1.130000e+02  1.000000e+00  7.000000e+00  0.000000e+00   
50%    1.620000e+02  1.620000e+02  1.000000e+00  1.000000e+01  5.000000e-01   
75%    2.340000e+02  2.340000e+02  1.000000e+00  1.600000e+01  2.500000e+00   
max    2.650000e+02  2.650000e+02  5.000000e+00  3.009000e+03  3.050000e+01   

            mta_tax    tip_amount  tolls_amount  improvement_surcharge  \
count  3.143916e+06  3.143916e+06  3.143916e+06           3.143916e+06   
mean   4.903896e-01  2.596015e+00  5.148399e-01           2.968758e-01   
std    8.476900e-02  3.128478e+00  2.021245e+00           4.275566e-02   
min   -5.500000e-01 -9.800000e+01 -5.100000e+01          -3.000000e-01   
25%    5.000000e-01  0.000000e+00  0.000000e+00           3.000000e-01   
50%    5.000000e-01  2.060000e+00  0.000000e+00           3.000000e-01   
75%    5.000000e-01  3.260000e+00  0.000000e+00           3.000000e-01   
max    3.300000e+00  4.140000e+02  8.000700e+02           3.000000e-01   

       total_amount  congestion_surcharge   airport_fee  
count  3.143916e+06          3.043266e+06  3.043266e+06  
mean   2.090387e+01          2.301929e+00  8.583418e-02  
std    1.689414e+01          7.132180e-01  3.183060e-01  
min   -6.068000e+02         -2.500000e+00 -1.250000e+00  
25%    1.184000e+01          2.500000e+00  0.000000e+00  
50%    1.580000e+01          2.500000e+00  0.000000e+00  
75%    2.230000e+01          2.500000e+00  0.000000e+00  
max    3.012300e+03          2.500000e+00  1.250000e+00  

Time series analysis¶

One way to analyze the NYC taxi dataset is to look at how the number of rides varies over time. We can do this by creating a new column in the DataFrame that represents the pick-up date of each ride, and then use the groupby method to group the data by month or year and compute the count of rides for each group:

In [8]:
trips['PU_date'] = trips['tpep_pickup_datetime'].dt.date
count = trips.groupby('PU_date', as_index=False).agg(count=('VendorID', 'count'))
print(count)
  0%|          |   0.00/100 [00:00<?, ?it/s]
       PU_date   count
0   2021-11-30      54
1   2021-12-01  121986
2   2021-12-02  128647
3   2021-12-03  136231
4   2021-12-04  132392
5   2021-12-05  109654
6   2021-12-06  112665
7   2021-12-07  126649
8   2021-12-08  131662
9   2021-12-09  134724
10  2021-12-10  133497
11  2021-12-11  130247
12  2021-12-12  107167
13  2021-12-13  111299
14  2021-12-14  120218
15  2021-12-15  125048
16  2021-12-16  126270
17  2021-12-17  124147
18  2021-12-18  110279
19  2021-12-19   91200
20  2021-12-20   94213
21  2021-12-21   93093
22  2021-12-22   91699
23  2021-12-23   87850
24  2021-12-24   69619
25  2021-12-25   40807
26  2021-12-26   54705
27  2021-12-27   69897
28  2021-12-28   75037
29  2021-12-29   74989
30  2021-12-30   77968
31  2021-12-31       3

We can then use a library like plotly to visualize the time series data:

In [9]:
import plotly.express as px

b = px.bar(count.to_pandas(), x='PU_date', y='count')
b.show()

Spatial analysis¶

Another way to analyze the NYC taxi dataset is to look at patterns in the spatial distribution of rides. Taking Manhattan as an example, we firstly filter the dataframe by pick-up location ID:

In [10]:
manhattan_zones = taxi_zones[taxi_zones['Borough'] == 'Manhattan']['LocationID']
manhattan_trips = trips[trips['PULocationID'].isin(manhattan_zones)]
print(manhattan_trips)
  0%|          |   0.00/100 [00:00<?, ?it/s]
         VendorID tpep_pickup_datetime tpep_dropoff_datetime  passenger_count  \
1               2  2021-12-01 00:29:07   2021-12-01 00:45:13              2.0   
2               1  2021-12-01 00:12:40   2021-12-01 00:27:17              0.0   
3               1  2021-12-01 00:10:18   2021-12-01 00:19:20              1.0   
4               1  2021-12-01 00:25:12   2021-12-01 00:39:07              1.0   
6               2  2021-12-01 00:00:08   2021-12-01 00:18:27              2.0   
...           ...                  ...                   ...              ...   
3212643         2  2021-12-30 23:04:37   2021-12-30 23:22:59              NaN   
3212644         1  2021-12-30 23:31:58   2021-12-30 23:37:58              NaN   
3212645         1  2021-12-30 23:53:54   2021-12-31 00:03:17              NaN   
3212646         2  2021-12-30 23:28:00   2021-12-30 23:43:00              NaN   
3212672         2  2021-12-31 00:00:00   2021-12-31 00:08:00              NaN   

         trip_distance  RatecodeID store_and_fwd_flag  PULocationID  \
1                 2.76         1.0                  N           238   
2                 3.40         1.0                  N           239   
3                 1.30         1.0                  N           148   
4                 3.10         1.0                  N           231   
6                 5.18         1.0                  N           114   
...                ...         ...                ...           ...   
3212643           7.01         NaN               None            42   
3212644           1.30         NaN               None           164   
3212645           2.30         NaN               None           263   
3212646           4.46         NaN               None           246   
3212672           1.02         NaN               None           163   

         DOLocationID  payment_type  fare_amount  extra  mta_tax  tip_amount  \
1                  42             2        13.00    0.5      0.5        0.00   
2                  74             1        13.50    3.0      0.5        2.00   
3                  87             1         6.50    3.0      0.5        2.05   
4                 246             1        12.50    3.0      0.5        2.50   
6                 239             1        17.50    0.5      0.5        4.26   
...               ...           ...          ...    ...      ...         ...   
3212643            70             0        32.15    0.0      0.5        0.00   
3212644            90             0         6.50    0.5      0.5        1.03   
3212645           151             0         9.50    0.5      0.5        1.33   
3212646            87             0        17.34    0.0      0.5        4.42   
3212672           229             0        13.20    0.0      0.5        0.00   

         tolls_amount  improvement_surcharge  total_amount  \
1                0.00                    0.3         16.80   
2                0.00                    0.3         19.30   
3                0.00                    0.3         12.35   
4                0.00                    0.3         18.80   
6                0.00                    0.3         25.56   
...               ...                    ...           ...   
3212643          6.55                    0.3         39.50   
3212644          0.00                    0.3         11.33   
3212645          0.00                    0.3         14.63   
3212646          0.00                    0.3         25.06   
3212672          0.00                    0.3         16.50   

         congestion_surcharge  airport_fee     PU_date  
1                         2.5          0.0  2021-12-01  
2                         2.5          0.0  2021-12-01  
3                         2.5          0.0  2021-12-01  
4                         2.5          0.0  2021-12-01  
6                         2.5          0.0  2021-12-01  
...                       ...          ...         ...  
3212643                   NaN          NaN  2021-12-30  
3212644                   NaN          NaN  2021-12-30  
3212645                   NaN          NaN  2021-12-30  
3212646                   NaN          NaN  2021-12-30  
3212672                   NaN          NaN  2021-12-31  

[2827015 rows x 20 columns]

Then use the groupby method to group the data by pick-up location ID and compute the count of rides for each group:

In [11]:
gb_pu_location = manhattan_trips.groupby(['PULocationID'], as_index=False).agg(count=('VendorID', 'count')).to_pandas()
  0%|          |   0.00/100 [00:00<?, ?it/s]

We can then use a library like plotly to visualize the spatial distribution of rides:

In [12]:
m = px.choropleth(
    gb_pu_location,
    geojson=geojson,
    locations='PULocationID',
    featureidkey='properties.location_id',
    color='count',
    color_continuous_scale="Viridis",
    range_color=(0, gb_pu_location['count'].max()),
    labels={'count':'trips count'}
)
m.update_geos(fitbounds="locations", visible=False)
m.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
m.show()